-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improves the deserialization speed by reducing object initialization overhead #284
Conversation
High Level Overview
|
At a high-level I think we need to commit to dropping some of the existing factory methods. I know it's arguably part of our API, so we'd need to rev at least minor, but probably major version, but I think it's worth it. |
Thanks for doing the work to get to this place! I'm super excited to see the benefits of this! Let's not be afraid to be bold in breaking API compatibility to do so. |
amazon/ion/ion_py_objects.py
Outdated
from amazon.ion.symbols import SymbolToken | ||
|
||
|
||
class IonPyNull_new(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may discover a reasonable "why" later, but barring that, ditch the _new
suffix please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will keep this comment unsolved until renaming them.
I named them object
, object_new
and object_original
so that we can separate them in the existing tests for easier debug and performance comparisons.
EDIT: Once we approve other comments within this PR, I'll rename these classes to their appropriate names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please confirm this is still the plan. [refers to resolved conversation about removing _new classes before actually merging]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'll rename it once we address other comments. I'm using these class names for easier before/after performance comparison...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed the updated question. Would you also refer to removing the old IonPyObject? yeah I think we can completely switch to the _new IonPyObject and rename them to the official IonPyObject
. This won't cause breaking changes. (I'm not sure if the pure python implementation still need old IonPyObjects. If they are, I'll leave a comment for further discussion otherwise I'll safely deprecate these old IonPyObjects, maybe a separate sub-PR targeting this one for all the clean up work)
I left draft comments and tried to send all of them at once but seems the comment doesn't cache draft once I click resolve conversation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit b6fe2df renames IonPyObject_new
to IonPyObject
, we can safely remove IonPyObject_original
now.
Okay for this PR,
|
- Refactors some class attributes and renames tests and methods - Removes _copy() - Refactors ionPyDict's constrcutor signature - Changes ion_type to be a class attribute
amazon/ion/equivalence.py
Outdated
@@ -26,7 +26,7 @@ | |||
from amazon.ion.symbols import SymbolToken | |||
|
|||
|
|||
def isinstance_of_ion_nature(obj): | |||
def isinstance_of_ion_nature_or_new_ion_py_objects(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method will be deprecated with other old methods later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right... But what will replace it? I ask because I think that would be either just trying to get the iontype
and annotations
or testing if the two objects hasattr
them. So why not do that now and avoid the series of or
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, refactored it to check if both ion_type and ion_annotations attributes exist.
amazon/ion/ion_py_objects.py
Outdated
def __init__(self, ion_type, value, annotations=()): | ||
def __init__(self, ion_type=IonType.LIST, value=None, annotations=()): | ||
if value is None: | ||
value = [] | ||
super().__init__(value) | ||
self.ion_annotations = annotations | ||
# it's possible to be Sexp | ||
self.ion_type = ion_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the constructor still can control ion_type as it's possible to be Sexp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we call this IonPySequence or something, then? Or does this just need to conform to the existing naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, It won't break any current use case since we can call it any name we want (E.g., we are calling it an unofficial name with the _new
suffix). Since the original list and Sexp are called IonPyList, this might affect some write use case (Not sure, maybe for the read use case, people just need to use IonPyList as a native list so they don't need to worry about its class name?).
Or maybe we can redeclare it to ionPyList in simple_types.py
first, then fully deprecate it when we make breaking changes in the future.
ionPyList = ionPySequence
amazon/ion/equivalence.py
Outdated
@@ -26,7 +26,7 @@ | |||
from amazon.ion.symbols import SymbolToken | |||
|
|||
|
|||
def isinstance_of_ion_nature(obj): | |||
def isinstance_of_ion_nature_or_new_ion_py_objects(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right... But what will replace it? I ask because I think that would be either just trying to get the iontype
and annotations
or testing if the two objects hasattr
them. So why not do that now and avoid the series of or
s?
amazon/ion/ion_py_objects.py
Outdated
from amazon.ion.symbols import SymbolToken | ||
|
||
|
||
class IonPyNull_new(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please confirm this is still the plan. [refers to resolved conversation about removing _new classes before actually merging]
amazon/ion/ion_py_objects.py
Outdated
ion_type = IonType.NULL | ||
|
||
def __init__(self, ion_type=IonType.NULL, value=None, annotations=()): | ||
self.ion_type = ion_type # TODO initialized to NULL type first, what's the real type? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the action here? I would say it's OK as it is... or suggest that for IonPyNull
we could avoid the class attribute. But I don't understand if this points to a deeper issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
amazon/ion/ion_py_objects.py
Outdated
if isinstance(self, IonPyNull_new) or self.ion_type.is_container: | ||
value = None | ||
|
||
if in_struct: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we could avoid the logic and simplify the parameters by passing only passing field_name
if it were in a struct. realize that may be a breaking change, although I'd like to understand what behavior is expected when passing a field_name with in_struct=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't spend too much effort on the to_event
API since it's mainly used by the pure python implementation, but it seems that we do not have to maintain in_struct
. It requires more investigation to see if this will cause breaking changes (at least some low-level internal methods need this parameter but the top-level dump()
doesn't).
amazon/ion/ion_py_objects.py
Outdated
|
||
def to_event(self, event_type, field_name=None, in_struct=False, depth=None): | ||
value = self | ||
if isinstance(self, IonPyNull_new) or self.ion_type.is_container: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would either of these be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_event
was same as the original factory method, which was used by both container and non-container Ion types so it had this condition check. I refactored them.
amazon/ion/ion_py_objects.py
Outdated
self.__store = {} | ||
if value is not None: | ||
for key, value in iter(value.items()): | ||
self.__store[key] = [value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this overwrite when value.items()
has multiple mappings for a field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter value
of the original multimap constructor is limited to Python's native dictionary. I added logic to support multi-value Map.
amazon/ion/ion_py_objects.py
Outdated
return self.__store[key][len(self.__store[key]) - 1] # Return only one in order not to break clients | ||
|
||
def __delitem__(self, key): | ||
del self.__store[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the contract for multiply-mapped fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It deletes everything map to the given key currently.
amazon/ion/ion_py_objects.py
Outdated
def items(self): | ||
output = [] | ||
for k, v in self.iteritems(): | ||
output.append((k, v)) | ||
return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you refer to items()
vs.iteritems()
?
Python 3 has deprecated iteritems
for native dict. I'll see if we can delete it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for items()
:
The python 3 dict.items()
returns a set-like object providing a view on the dictionary's items.
Our IonPydict.item()
still keeps the python 2 behavior that returns a list (and as a result, have to maintain a list).
This Python 2/3 and dict/IonPyDict behavior difference need to be fixed or documented. I'll create a Github issue and see if we still need this method and why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created GH issue - #288
amazon/ion/ion_py_objects.py
Outdated
annotations=self.ion_annotations, depth=depth) | ||
|
||
|
||
class IonPyDict_new(MutableMapping): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is existing pydoc about the contract for setting, deleting and accessing fields with multiple mappings please ensure it is moved. If not, now is a great time to add it! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I'm working on adding a method documents according to the current behavior.
A new commit is available below. An overall comment: This PR tried to incorporate the performance improvement into the code with the least changes in case of any breaking change. Some methods are copied from the original IonPyObject but it's time to refactor/deprecate some of them! Also Added some documents. Clean up work:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job on these performance improvements! Just minor comments from me.
amazon/ion/ioncmodule.c
Outdated
_ionpyint_fromvalue = _ionpyint_cls; | ||
_ionpynull_fromvalue = _ionpynull_cls; | ||
_ionpyfloat_fromvalue = _ionpyfloat_cls; | ||
_ionpydecimal_fromvalue = _ionpydecimal_cls; | ||
_ionpytext_fromvalue = _ionpytext_cls; | ||
_ionpylist_fromvalue = _ionpylist_cls; | ||
_ionpydict_fromvalue = _ionpydict_cls; | ||
_ionpybool_fromvalue = _ionpybool_cls; | ||
_ionpysymbol_fromvalue = _ionpysymbol_cls; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the lines like
_ionpynull_fromvalue = PyObject_GetAttrString(_ionpynull_cls, "from_value");
above, which these seem to overwrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I added a new action item 2 to the next step list above.
tests/test_simpleion.py
Outdated
# generate_containers_binary(_SIMPLE_CONTAINER_MAP), | ||
# generate_annotated_values_binary(SIMPLE_SCALARS_MAP_BINARY, _SIMPLE_CONTAINER_MAP), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these still be commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
amazon/ion/ion_py_objects.py
Outdated
__name__ = 'IonPyNull_new' | ||
__qualname__ = 'IonPyNull_new' | ||
|
||
def __init__(self, ion_type=IonType.NULL, value=None, annotations=()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we drop the value
parameter? Or is it needed for consistency with the other __init__
implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to maintain the constructors' signatures identical to from_value
's, so that we can perform an in-place replacement. This approach can also help us quantify the overhead caused by from_value
. For example, switching from from_value
to their constructors introduces an 8-12% improvement in performance.
value
is not necessarily required after the PR is merged. I added a clean-up work 3 above to drop this parameter.
amazon/ion/ion_py_objects.py
Outdated
def __init__(self, ion_type, value, annotations=()): | ||
def __init__(self, ion_type=IonType.LIST, value=None, annotations=()): | ||
if value is None: | ||
value = [] | ||
super().__init__(value) | ||
self.ion_annotations = annotations | ||
# it's possible to be Sexp | ||
self.ion_type = ion_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we call this IonPySequence or something, then? Or does this just need to conform to the existing naming?
* Deprecated `ion_nature` * Migrated objects construction from `from_value()` to their constructors in C extension * Added IonPytimestamp and IonPyBytes * Replaced all the original IonPyObject and factory methods with the new IonPyObjects. * Worked on the IonPyTimestamp signature to make it compatible with the amazon.core.Timestamp * Updated the unit tests to accommodate the changes.
The clean up commit did below:
|
Description:
Publish the PR for feedback but will keep it in draft mode until we are confident to merge this change. More changes and tests come soon.
This PR improved the library's performance by 40-50% through:
(1) migrating to new ionPyObjects without inheritance (+~20%),
(2) switching the way of object initialization from from_value() to constructor directly (+~10%), and
(3) optimizing Ion struct's underlying storage implementation (+~20%)
Unit tests
This PR also fixed more than 2k unit test failures due to the changes of IonPyObjects. Most of failures are because of (1) ion_nature inheritance, (2) from_value invocation, and (3) methods signature differences.
Detail:
In detail, create an IonPyObject requires a few steps which causes a lot of overhead. The overheads mainly come from (1) Ion_Nature inheritance, (2) from_value() method invocation overhead and (3) methods refactor overhead. In addition, Ion Struct, the Ion multiMap, produces a lot of overhead due to use of high level abstract classes such as OrderedDict, mutableMapping and MutableSequence. In the Ion Python C extension, we always create an empty Ion Struct first, and recursively call its "add_item" API to add new values. Every time inserting a value into the Struct, the code will check if this is the first value occurs for this key. If it's the first value for the key, it will create a new MultiSequence for the key, otherwise it will just insert the value into the list. Creating new mutableValues is extremely expensive when deserializing large Ion files.
So this PR first removed Ion_Nature inheritance and the reusable help functions for creating IonPyObject classes.
Then, all new IonPyObjects has its own constructor. The PR next switched the way of IonPyObjects initialization from
from_value()
to their constructors directly to avoid unnecessary method invocation overhead.Last, the PR optimized the multiMap implementation by using native list instead of MutableSequence. In addition, since the Ion Struct is an unordered collection, we uses a Python dict directly instead of previous OrderedDict.
Benchmarking
The performance gain depends on various factors such as data model and size. Below are three sample data I used for benchmarking:
Next Step
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.